Models for proportion (part to whole percent) data: Part II

Data From Fig 2d – Inhibition of fatty acid oxidation enables heart regeneration in adult mice

The researchers present two tests/plots: the effect of genotype on fibrotic area and the effect on the percent of fibrotic area, relative to total heart area. A GLM with offset combines these into a single analysis/plot and has more power than the linear model/t-test or Mann-Whitney U.
percents
proportions
generalized linear model
offset
power
simulation
Author

Jeff Walker

Published

July 11, 2024

Modified

October 2, 2024

Better than Reproducibility of Fig 5c. The CIs and p-values are from a quasipoisson GLM model with offset

Vital info

Data From: Li, X., Wu, F., Günther, S., Looso, M., Kuenne, C., Zhang, T., … & Braun, T. (2023). Inhibition of fatty acid oxidation enables heart regeneration in adult mice. Nature, 622(7983), 619-626.

Fig: 2d download data

key words:

Published methods: t-test on both raw and proportion data.

Design: Completely Randomized Design (CRD)

Response: fibrotic_area (the area of the injured tissue in the heart), both raw and as a proportion of total heart area.

Key learning concepts: proportions and offsets

More info: 18.4 Example 2 – Use a GLM with an offset instead of a ratio of some measurement per total (“dna damage” data fig3b)

The Experiment

Background: Postnatal maturation of cardiomyocytes is characterized by a metabolic switch from glycolysis to fatty acid oxidation, chromatin reconfiguration and exit from the cell cycle, instating a barrier for adult heart regeneration1,2. Here, to explore whether metabolic reprogramming can overcome this barrier and enable heart regeneration, we abrogate fatty acid oxidation in cardiomyocytes by inactivation of Cpt1b.

Analysis of RNA-sequencing (RNA-seq) datasets revealed reduced expression levels of several key genes of glycolysis and cell cycle progression in the course of CM maturation during the first week after birth, whereas genes related to fatty acid oxidation (FAO) and the Krebs cycle were upregulated (Extended Data Fig. 1a). FAO-related genes that were upregulated included the muscle-specific isoform of carnitine palmitoyltransferase Cpt1b but not the ubiquitously expressed Cpt1a isoform, which are required for mitochondrial uptake of fatty acids and subsequent FAO (Extended Data Fig. 1b).

Motivation: To investigate whether Cpt1b inactivation and the ensuing proliferation of CMs enables heart regeneration, we subjected Cpt1bcKO and Cpt1biKO mice to ischaemia–reperfusion (I–R) injury, a model that closely mimics the situation in human patients receiving a stent for revascularization of an obstructed coronary artery. I–R-induced scars were virtually absent in Cpt1bcKO and Cpt1biKO mice after 3 weeks compared to control animals, although the area at risk (AAR) was similar in both mutant hearts 24 h after I–R surgery (Fig. 2a–f, Extended Data Fig. 5a,b and Supplementary Information).

Setup

Import and Wrangle

Code
data_from <- "Inhibition of fatty acid oxidation enables heart regeneration in adult mice"
file_name <- "41586_2023_6585_MOESM6_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)

genotype_levels <- c("Ctrl", "Cpt1biKO")
fig2d <- read_excel(file_path,
                    sheet = "Fig2.d",
                    range = "B1:C10") |>
  data.table() |>
  melt(measure.vars = genotype_levels,
       variable.name = "genotype",
       value.name = "fibrotic_area") |>
  na.omit()
fig2d[, genotype := factor(genotype, levels = genotype_levels)]

fig2d_percent <- read_excel(file_path,
                    sheet = "Fig2.d",
                    range = "B12:C21") |>
  data.table() |>
  melt(measure.vars = genotype_levels,
       variable.name = "genotype",
       value.name = "percent") |>
  na.omit()
fig2d[, fibrotic_area_percent := fig2d_percent[, percent]]
fig2d[, fibrotic_area_proportion := fibrotic_area_percent/100]
fig2d[, heart_area := fibrotic_area/fibrotic_area_proportion]


fileout_name <- "Fig2d-proportion - Inhibition of fatty acid oxidation enables heart regeneration in adult mice.xlsx"
fileout_path <- here(data_folder, data_from, fileout_name)
write_xlsx(fig2d, fileout_path)

Reproducibility – linear model / t-test

Code
lm1 <- lm(fibrotic_area ~ genotype, data = fig2d)
lm1_emm <- emmeans(lm1, specs = "genotype")
lm1_pairs <- contrast(lm1_emm, method = "revpairwise") |>
  summary(infer = TRUE)

gg1 <- plot_response(lm1, lm1_emm, lm1_pairs,
                     y_label = "Fibrotic Area",
                     palette = "pal_okabe_ito_blue")
  
lm2 <- lm(fibrotic_area_percent ~ genotype, data = fig2d)
lm2_emm <- emmeans(lm2, specs = "genotype")
lm2_pairs <- contrast(lm2_emm, method = "revpairwise") |>
  summary(infer = TRUE)

gg2 <- plot_response(lm2, lm2_emm, lm2_pairs,
                     y_label = "Percent Fibrotic Area",
                     palette = "pal_okabe_ito_blue")

# contrast_table <- rbind(lm1_pairs, lm2_pairs)
# contrast_table |>
#   kable(digits = 4) |>
#   kable_styling() |>
#   pack_rows("linear model/t-test of fibrotic area", 1, 1) |>
#   pack_rows("linear model/t-test of fibrotic area as percent of total area", 2, 2)

plot_grid(gg1, gg2, ncol = 2)

Notes on what the researchers did

  1. The researchers report a t-test of fibrotic area (left panel above). The effect size (the difference in mean fibrotic area between genotypes) is confounded by differences in heart size between genotypes. So this comparison is not very meaninful as is.
  2. The researchers report a t-test of relative fibrotic area (the percent of total heart area that is fibrotic)(right panel above). Relative fibrotic area is a meaningful response but presenting this as a difference is weird as we move from near zero to near 0.5 to near 1.
  3. look what the researcher’s t-tests implies about inference. Both t-tests imply the 95% CIs shown in the plots. These CIs imply mean fibrotic areas that are less then zero are consistent with the data!

A GLM with fibrotic area using log(heart area) as an offset variable combines the two analyses above into a single analysis!

A common way to test for the effect of a treatment when the effect is confounded by a covariate is ANCOVA. For example, many researchers test for the effect of a treatment on O2 consumption and use body weight as a covariate, since O2 consupmption increases with body weight. A linear model with an offset is exactly like an a linear model with a continuous covariate (ANCOVA model) except the coefficient (slope) of the covariate is forced to be 1.

Here is a covariate (ANCOVA) model:

\[ \texttt{log}(\texttt{fibrotic area}) = b_0 + b_1 \texttt{log}(\texttt{heart area}) + b_2 \texttt{genotype} \]

where \(b_1\) is the slope of fibrotic area on heart area (note that the model does not contain the interaction of \(\texttt{genotype}\) and \(\texttt{heart area}\)). If we don’t expect larger hearts to have relatively more (or less) fibrotic area, then we would expect \(b_1 = 1\). An offset forces this expectation.

Importantly, if we use a GLM with log link, the model only models the log-transformed group means – not the means of the log transformed data, as in the ANCOVA model above.

Code
m3 <- glm(fibrotic_area ~ genotype + offset(log(heart_area)),
          family = Gamma(link = "log"),
          data = fig2d)
m3_emm <- emmeans(m3, specs = "genotype", type = "response")
m3_pairs <- contrast(m3_emm, method = "revpairwise") |>
  summary(infer = TRUE)

What is the effect in the GLM model with offset?

By forcing the coefficient to be one, the effect (0.23) is the ratio of the mean proportion of the KO group to the mean proportion of the Cn group, even though we used the raw fibrotic area (and not the proportion) as the response variable! This is the beauty of the offset model!

Code
m3_pairs |>
  kable(digits = c(1,7,3,1,2,2,1,2,6)) |>
  kable_styling(full_width = FALSE)
contrast ratio SE df lower.CL upper.CL null t.ratio p.value
Cpt1biKO / Ctrl 0.2312978 0.075 14 0.12 0.46 1 -4.51 0.000486

Here is the ratio computed manually

Code
mean(fig2d[genotype == "Cpt1biKO", fibrotic_area_proportion])/mean(fig2d[genotype == "Ctrl", fibrotic_area_proportion])
[1] 0.2312978

Plot the model! With the percent data!

While fibrotic area was the response variable in our statistical model, we want to plot the model with the proportion data. This means we need to convert the estimated model means and CIs to proportions. This is easy, simply divide these values by the mean heart area, ignoring genotype levels.

Code
m3_emm_dt <- m3_emm |>
  summary() |>
  data.table()
m3_pairs_dt <- data.table(m3_pairs)

common_factor <- mean(fig2d$heart_area)
m3_emm_dt[, mean := response/common_factor * 100]
m3_emm_dt[, lo := lower.CL/common_factor * 100]
m3_emm_dt[, hi := upper.CL/common_factor * 100]

Here is the percent fibrotic area computed from the model.

Code
m3_emm_dt[, .SD, .SDcols = c("genotype", "mean")]
   genotype     mean
     <fctr>    <num>
1:     Ctrl 8.127159
2: Cpt1biKO 1.879794

And here is the percent fibrotic area computed by simply averaging the percent data.

Code
fig2d[, .(fibrotic_area_percent = mean(fibrotic_area_percent)),
                                by = .(genotype)]
   genotype fibrotic_area_percent
     <fctr>                 <num>
1:     Ctrl              8.127159
2: Cpt1biKO              1.879794
Code
gg <- ggplot(data = fig2d,
             aes(x = genotype,
                 y = fibrotic_area_percent,
                 color = genotype)) +
  geom_sina(size = 2,
            maxwidth = 0.5,
            show.legend = FALSE) +
  geom_errorbar(data = m3_emm_dt,
                aes(x = genotype,
                    y = mean,
                    ymin = lo,
                    ymax = hi),
                width = 0.05,
                color = "black") +
  geom_point(data = m3_emm_dt,
             aes(x = genotype,
                 y = mean),
             size = 3,
             color = "black") +
  ylab("Fibrotic Area (% of Heart Area)") +
  scale_color_manual(values = pal_okabe_ito_2) +
  theme_pubr() +
  theme(axis.title.x = element_blank()) +
  NULL

  # add p-values
m3_pairs_dt[, group1 := "Ctrl"]
m3_pairs_dt[, group2 := "Cpt1biKO"]
m3_pairs_dt[, p := p.value |>
              p_round(digits = 2) |>
              p_format(digits = 2, accuracy = 1e-03, add.p = TRUE)]
maxy <- fig2d[, max(fibrotic_area_percent)]
miny <- fig2d[, min(fibrotic_area_percent)]
m3_pairs_dt[, y.position := maxy + 0.05*(maxy - miny)]

gg <- gg +
  stat_pvalue_manual(
    data = m3_pairs_dt,
    label = "p",
    tip.length = 0.001)

gg

Code
save_it <- TRUE
if(save_it){
out_fig <- "fig2d_ggplot.png"
out_path <- here("figs", data_from, out_fig)
ggsave(out_path)
}

This is beautiful. We don’t need two plots. The GLM with offset estimates the effect of genotype on Fibrotic Area adjusting for differences in total Heart Area. And we’ve plotted the model using the percent fibrotic area data and estimated means and CIs from the model.

But is the GLM with offset any good? – A simulation comparing it to what the researchers did, or might have done.

Code
get_clean_proportion_data <- function(
    n_sample,
    n_cols,
    alpha_num,
    beta_num,
    alpha_denom,
    beta_denom,
    rho_obs,
    seed_starter){
  rho_matrix = matrix(c(1, r_obs, r_obs, 1), nrow = 2, ncol = 2)
  params_num <- calc_theory(Dist = "Gamma", params = c(alpha_num, beta_num)) # fibrotic area
  params_denom <- calc_theory(Dist = "Gamma", params = c(alpha_denom, beta_denom)) # heart area
  n_cols_extra <- floor(n_cols * 1.2)
  fake_y <- rcorrvar(n = n_sample * n_cols_extra,
                     k_cont = 2,
                     method = "Polynomial",
                     means = c(params_num[1], params_denom[1]), # mean
                     vars = c(params_num[2]^2, params_denom[2]^2),
                     skews = c(params_num[3], params_denom[3]),
                     skurts = c(params_num[4], params_denom[4]),
                     fifths = c(params_num[5], params_denom[5]),
                     sixths = c(params_num[6], params_denom[6]),
                     rho = rho_matrix,
                     seed = seed_starter)$continuous_variables
  exc <- which(fake_y[, "V1"] < 0)
  if(length(exc > 0)){
    fake_y <- fake_y[-exc, ]}
  fake_y <- fake_y[1:(n_sample * n_cols), ]
  num_mat <- matrix(fake_y[,"V1"], nrow = n_sample, ncol = n_cols)
  denom_mat <- matrix(fake_y[,"V2"], nrow = n_sample, ncol = n_cols)
  prop_mat <- num_mat/denom_mat
  
  return(list(
    numerator = num_mat,
    denominator = denom_mat,
    proportion = prop_mat
  ))
}
Code
# get parameters
do_it <- FALSE
if(do_it){
  iter_sets <- 4
  n_iter <- 2500
  total_iter <- n_iter * iter_sets
  # parameters
  # sample size
  fig2d_summary <- fig2d[, .(N = .N), by = .(genotype)]
  
  # correlation between fibrotic area and heart area
  fig2d[, fibrotic_area_resid := residuals(lm(fibrotic_area ~ genotype, data = fig2d))]
  fig2d[, heart_area_resid := residuals(lm(heart_area ~ genotype, data = fig2d))]
  r_obs <- cor(fig2d$fibrotic_area_resid, fig2d$heart_area_resid)
  # the correlation is low-ish: 0.367
  rho_matrix = matrix(c(1, r_obs, r_obs, 1), nrow = 2, ncol = 2)
  
  # gamma parameters
  # numerator -- fibrotic area
  glm1 <- glm(fibrotic_area ~ genotype,
              family = Gamma(link = "log"),
              data = fig2d)
  glm1_emm <- emmeans(glm1, specs = "genotype", type = "response") |>
    summary() |>
    data.table()
  glm1_mu <- glm1_emm[, response]
  # make effect smaller for power
  effect <- 0.5
  glm1_mu[2] <- glm1_mu[1]*effect
  glm1_alpha <- gamma.shape(glm1)[[1]]
  glm1_beta <- glm1_alpha/glm1_mu # = rate or 1/scale
  
  # denominator -- heart area
  glm2 <- glm(heart_area ~ genotype,
              family = Gamma(link = "log"),
              data = fig2d)
  glm2_emm <- emmeans(glm2, specs = "genotype", type = "response") |>
    summary() |>
    data.table()
  glm2_mu <- glm2_emm[, response]
  glm2_alpha <- gamma.shape(glm2)[[1]]
  glm2_beta <- glm2_alpha/glm2_mu # = rate or 1/scale
  
  num_params_cn <- calc_theory(Dist = "Gamma", params = c(glm1_alpha, glm1_beta[1])) # fibrotic area
  denom_params_cn <- calc_theory(Dist = "Gamma", params = c(glm2_alpha, glm2_beta[1])) # heart area
  num_params_ko <- calc_theory(Dist = "Gamma", params = c(glm1_alpha, glm1_beta[2])) # fibrotic area
  denom_params_ko <- calc_theory(Dist = "Gamma", params = c(glm2_alpha, glm2_beta[2])) # heart area
  

  fd <- data.table(
    genotype = rep(c("Ctrl", "Cpt1biKO"), fig2d_summary$N),
    fibrotic_area = as.numeric(NA),
    heart_area = as.numeric(NA)
  )
  
  power_matrix <- matrix(nrow = total_iter, ncol = 4)
  colnames(power_matrix) <- c("mwu","lm", "qb", "gamma")
  type1_matrix <- matrix(nrow = total_iter, ncol = 4)
  colnames(type1_matrix) <- c("mwu","lm", "qb", "gamma")

  seed_starter_sim <- 0
  for(iter_set in 1:iter_sets){
    # power
    # make effect size smaller so power < 1
    seed_starter_sim <- seed_starter_sim + 1
    fake_cn <- get_clean_proportion_data(
      n_sample = fig2d_summary[1, N],
      n_cols = n_iter,
      alpha_num = glm1_alpha,
      beta_num = glm1_beta[1],
      alpha_denom = glm2_alpha,
      beta_denom = glm2_beta[1],
      rho_obs = r_obs,
      seed_start = seed_starter_sim)

    seed_starter_sim <- seed_starter_sim + 1
    fake_ko <- get_clean_proportion_data(
      n_sample = fig2d_summary[2, N],
      n_cols = n_iter,
      alpha_num = glm1_alpha,
      beta_num = glm1_beta[2],
      alpha_denom = glm2_alpha,
      beta_denom = glm2_beta[2],
      rho_obs = r_obs,
      seed_start = seed_starter_sim)
    
    fibrotic_area_mat <- rbind(fake_cn[["numerator"]],
                               fake_ko[["numerator"]])
    heart_area_mat <- rbind(fake_cn[["denominator"]],
                            fake_ko[["denominator"]])
    fibrotic_prop_mat <- rbind(fake_cn[["proportion"]],
                               fake_ko[["proportion"]])
    
    for(iter in 1:n_iter){
      sim_i <- n_iter*(iter_set - 1) + iter
      # fake data
      fd[, fibrotic_area := fibrotic_area_mat[, iter]]
      fd[, heart_area := heart_area_mat[, iter]]
      fd[, fibrotic_prop := fibrotic_prop_mat[, iter]]
      
      # Mann Whitney
      power_matrix[sim_i, "mwu"] <- wilcox.test(fibrotic_prop ~ genotype, data = fd)$p.value
      
      # lm
      m_lm <- lm(fibrotic_prop ~ genotype, data = fd)
      power_matrix[sim_i, "lm"] <- coef(summary(m_lm))["genotypeCtrl", "Pr(>|t|)"]
      
      # glm bin
      m_qb <- glm(fibrotic_prop ~ genotype,
                  family = quasibinomial,
                  data = fd)
      power_matrix[sim_i, "qb"] <- coef(summary(m_qb))["genotypeCtrl", "Pr(>|t|)"]
      
      # glm qp
      m_gamma <- glm(fibrotic_area ~ genotype + offset(log(heart_area)),
                  family = Gamma(link = "log"),
                  data = fd)
      power_matrix[sim_i, "gamma"] <- coef(summary(m_gamma))["genotypeCtrl", "Pr(>|t|)"]
      
      #       do_plot <- FALSE
      #       if(do_plot){
      # m_gamma_emm <- emmeans(m_gamma, specs = "genotype", type = "response")
      # m_gamma_pairs <- contrast(m_gamma_emm, method = "revpairwise") |>
      #   summary(infer = TRUE)
      #        plot_response(m_qb, m_qb_emm, m_qb_pairs)
              # ggplot(data = fd,
              #        aes(x = genotype,
              #            y = fibrotic_prop,
              #            color = genotype)) +
              #   geom_point()
      #       }
      
    }
    
    # type I
    seed_starter_sim <- seed_starter_sim + 1
    fake_cn <- get_clean_proportion_data(
      n_sample = fig2d_summary[1, N],
      n_cols = n_iter,
      alpha_num = glm1_alpha,
      beta_num = glm1_beta[1],
      alpha_denom = glm2_alpha,
      beta_denom = glm2_beta[1],
      rho_obs = r_obs,
      seed_start = seed_starter_sim)

    seed_starter_sim <- seed_starter_sim + 1
    fake_ko <- get_clean_proportion_data(
      n_sample = fig2d_summary[2, N],
      n_cols = n_iter,
      alpha_num = glm1_alpha,
      beta_num = glm1_beta[1],
      alpha_denom = glm2_alpha,
      beta_denom = glm2_beta[1],
      rho_obs = r_obs,
      seed_start = seed_starter_sim)
    
    fibrotic_area_mat <- rbind(fake_cn[["numerator"]],
                               fake_ko[["numerator"]])
    heart_area_mat <- rbind(fake_cn[["denominator"]],
                            fake_ko[["denominator"]])
    fibrotic_prop_mat <- rbind(fake_cn[["proportion"]],
                               fake_ko[["proportion"]])
    
    for(iter in 1:n_iter){
      sim_i <- n_iter*(iter_set - 1) + iter
      # fake data
      fd[, fibrotic_area := fibrotic_area_mat[, iter]]
      fd[, heart_area := heart_area_mat[, iter]]
      fd[, fibrotic_prop := fibrotic_prop_mat[, iter]]
      
      # Mann Whitney
      type1_matrix[sim_i, "mwu"] <- wilcox.test(fibrotic_prop ~ genotype, data = fd)$p.value
      
      # lm
      m_lm <- lm(fibrotic_prop ~ genotype, data = fd)
      type1_matrix[sim_i, "lm"] <- coef(summary(m_lm))["genotypeCtrl", "Pr(>|t|)"]
      
      # glm bin
      m_qb <- glm(fibrotic_prop ~ genotype,
                  family = quasibinomial,
                  data = fd)
      type1_matrix[sim_i, "qb"] <- coef(summary(m_qb))["genotypeCtrl", "Pr(>|t|)"]
      
      # glm qp
      m_gamma <- glm(fibrotic_area ~ genotype + offset(log(heart_area)),
                  family = Gamma(link = "log"),
                  data = fd)
      type1_matrix[sim_i, "gamma"] <- coef(summary(m_gamma))["genotypeCtrl", "Pr(>|t|)"]

    }    
  }

  saveRDS(power_matrix, "power_matrix.Rds")
  saveRDS(type1_matrix, "type1_matrix.Rds")
}else{
  power_matrix <- readRDS("power_matrix.Rds")
  type1_matrix <- readRDS("type1_matrix.Rds")
}

pless <- function(x, alpha = 0.05){
  stat <- sum(x <= alpha)/length(x)
  return(stat)
}
Code
res <- data.table(
  Test = c("Mann-Whitney", "Linear Model/T-test", "Quasibinomial", "Gamma offset"),
  Power = apply(power_matrix, 2, pless, alpha = 0.05),
  "Type I" = apply(type1_matrix, 2, pless, alpha = 0.05)
)
res |>
  kable(digits = c(1,2,3)) |>
  kable_styling(full_width = FALSE)
Test Power Type I
Mann-Whitney 0.71 0.044
Linear Model/T-test 0.71 0.049
Quasibinomial 0.75 0.052
Gamma offset 0.84 0.061

For data that look like those in fig. 2d, the Gamma GLM with offset has 18% higher power than both the linear model/t-test and the Mann-Whitney U, but has a very slightly inflated Type I error.